The right approach is highly dependent on the nature of the missing data.
Dealing with missing data requires understanding why it is missing first!
Deletion Methods
Listwise Deletion
Removing any rows that contain missing values for any relevant variables.
Analysis carried out on complete cases only.
Pairwise deletion
Removing cases where all variables contain missing values.
Analysis can be carried out on any data that is complete.
Listwise deletion can be appropriate when data is missing completely at random, and the volume of missing values is not itself problematic.
Pairwise deletion is less common, but can also work when data is MCAR, and assumes that data is approximately normally distributed.
Imputation Methods
Imputation involves replacing missing values with values inferred from the rest of the data.
There are two types of imputation:
Single Imputation - Replacing missing values with a single value, estimated from some statistical procedure.
Multiple Imputation - Creating multiple datasets each replacing missing values with plausible estimated values and pooling estimates from analyses carried out on each dataset.
Mean Imputation
The simplest statistical procedure for imputing missing values is replacing all missing values with the variable’s average value1.
This is almost always a bad idea, because it has multiple flaws:
Distorts the variable’s distribution, underestimating it’s variance (Van Buuren 2018).
Disrupts the relationship between the variable with imputed values and all other variables (Nguyen 2020).
plot_regression <-function(data) { data |>ggplot(aes(x = total_working_years, y = monthly_income)) +geom_point(shape =21, size =1.5, alpha = .5) +geom_smooth(method = lm, colour ="#005EB8",fill ="#005EB8", alpha = .5 ) +labs(x ="Total Working Years", y ="Monthly Income",title ="Monthly Income ~ Total Length of Career" ) }attrition |>plot_regression()
Mean Imputation
set.seed(123)missing_years <- attrition |>mutate(total_working_years =replace( total_working_years,runif(n()) <0.8& (job_level <=2| age >35), NA ) )missing_years |> mice::mice(method ="mean", m =1,maxit =1, print =FALSE ) |> mice::complete() |>plot_regression()
Regression Imputation
A more robust approach to single imputation is to estimate missing values using a predictive model of the variable in question, using the rest of the variables in the dataset.
Regression imputation can rely on a variety of models, based on the type of data being imputed and how complex the model should be.
Imputing one value for a missing datum cannot be correct in general, because we don’t know what value to impute with certainty (if we did, it wouldn’t be missing) (Rubin 1987).
Multiple Imputation
Multiple imputation involves generating multiple datasets, performing analysis on each, and pooling the results. This is a two-stage process:
Generate multiple completed datasets, filling missing values using a statistical model that estimates imputation values, plus a random component to capture the uncertainty in the estimate.
Compute estimates on each completed dataset before combining them as pooled estimates and standard errors, using Rubin (1987)’s formula (Murray 2018).
The methods used for each stage may differ, but this two-stage approach is generally consistent across all forms of multiple imputation.
This approach acknowledges the uncertainty in the imputation of missing values, and bakes that uncertainty into the process, instead of treating imputed values with equal weight/certainty as non-missing values.
set.seed(123)missing_income <- attrition |>mutate(monthly_income =replace(monthly_income, runif(n()) <0.8& (job_level <=2| total_working_years >10), NA),total_working_years =replace(total_working_years, runif(n()) <0.5& job_level >=3, NA))get_pooled_estimates <-function(data, method, m, maxit) { data |> mice::mice(method = method, m = m, maxit = maxit, print =FALSE, seed =123) |>with(glm(factor(attrition) ~ arm::rescale(monthly_income) + total_working_years, family ="binomial")) |> mice::pool() }no_imp <-glm(factor(attrition) ~ arm::rescale(monthly_income) + total_working_years, family ="binomial", data = attrition)mean_imp <- missing_income |>get_pooled_estimates(method ="mean", m =1, maxit =1)norm_imp <- missing_income |>get_pooled_estimates(method ="norm.predict", m =1, maxit =1)pmm_imp <- missing_income |>get_pooled_estimates(method ="pmm", m =50, maxit =20)
Regression Estimates
models <-list("No Imputation"= no_imp,"Mean"= mean_imp,"Regression"= norm_imp,"Predictive Mean Matching"= pmm_imp)cm <-c("(Intercept)"="(Intercept)","arm::rescale(monthly_income)"="Monthly Income","total_working_years"="Total Working Years")modelsummary::modelsummary( models, exponentiate =TRUE, output ="gt",coef_map = cm, gof_omit ="IC|Log|F|RMSE",title ="Logstic Regressions of Job Attrition" ) |> gt::tab_spanner(label ="Single Imputation", columns =3:4) |> gt::tab_spanner(label ="Multiple Imputation", columns =5)
Logstic Regressions of Job Attrition
No Imputation
Single Imputation
Multiple Imputation
Mean
Regression
Predictive Mean Matching
(Intercept)
0.301
0.423
0.268
0.303
(0.058)
(0.062)
(0.056)
(0.072)
Monthly Income
0.550
0.865
0.444
0.564
(0.154)
(0.142)
(0.134)
(0.227)
Total Working Years
0.950
0.913
0.959
0.949
(0.016)
(0.014)
(0.017)
(0.019)
Num.Obs.
1470
1470
1470
1470
Num.Imp.
50
Conclusion
Not dealing with missing values is a methodological choice, because any tools for computing statistical models will deal with those missing values (usually this means listwise deletion).
How missing values should be dealt with is dependent on the nature of the missingness (MCAR, MAR, MNAR).
Simple imputation is quick and easy but it may not be very robust, especially when imputing average values.
The best solution for missing values is to find them, but failing that, consider multiple imputation.
Further Resources
{mice} - Multivariate Imputation by Chained Equations